Order - Aware Etl Workflows
نویسندگان
چکیده
Tziovara, Vasiliki, A. MSc, Computer Science Department, University of Ioannina, Greece. October, 2006. Order-Aware ETL Workflows. Thesis Supervisor: Panos Vassiliadis. Data Warehouses are collections of data coming from different sources, used mostly to support decision making and data analysis in an organization. To populate a data warehouse with up-to-date records that are extracted from the sources, special tools are employed, called Extraction – Transform – Load (ETL) tools, which organize the steps of the whole process as a workflow. An ETL workflow can be considered as a directed acyclic graph (DAG) used to capture the flow of data from the sources to the data warehouse. The nodes of the graph are activities that apply transformations or cleansing procedures on data or recordsets used for storage purposes. The edges of the graph are input/output relationships between the nodes. The workflow is an abstract design at the logical level, which has to be implemented physically, i.e., to be mapped to a combination of executable programs/scripts that perform the ETL workflow. Each activity of the workflow can be implemented physically using various algorithmic methods, each with different cost in terms of time requirements or system resources (e.g., memory, space on disk, etc.). The objective of this work is to identify the best possible implementation of a logical ETL workflow. For this reason, we employ (a) a library of templates for the activities and (b) a set of mappings between logical and physical templates. First, we use a simple cost model, that computes as optimal, the scenario with the best expected execution speed. In this work, we model the problem as a state-space search problem and propose an exhaustive algorithm for state generation to discover the optimal physical implementation of the scenario. To this end, we propose a cost model as a discrimination criterion between physical representations, which works also for black-box activities with unknown semantics. We also study the effects of possible system failures to the workflow operation. The difficulty in this case, lies at the
منابع مشابه
Benchmarking ETL Workflows
Extraction–Transform–Load (ETL) processes comprise complex data workflows, which are responsible for the maintenance of a Data Warehouse. A plethora of ETL tools is currently available constituting a multi-million dollar market. Each ETL tool uses its own technique for the design and implementation of an ETL workflow, making the task of assessing ETL tools extremely difficult. In this paper, we...
متن کاملTowards a Benchmark for ETL Workflows
Extraction–Transform–Load (ETL) processes comprise complex data workflows, which are responsible for the maintenance of a Data Warehouse. Their practical importance is denoted by the fact that a plethora of ETL tools currently constitutes a multi-million dollars market. However, each one of them follows a different design and modeling technique and internal language. So far, the research commun...
متن کاملSystematic ETL management - Experiences with high-level operators
Large organizations load much of their data into data warehouses for subsequent querying, analysis, and data mining. Extract-Transform-Load (ETL) workflows populate those data warehouses with data from various data sources by specifying and executing a set of transformations forming a directed acyclic transformation graph (DAG). Over time, hundreds of individual ETL workflows evolve as new sour...
متن کاملDetermining Essential Statistics for Cost Based Optimization of an ETL Workflow
Many of the ETL products in the market today provide tools for design of ETL workflows, with very little or no support for optimization of such workflows. Optimization of ETL workflows pose several new challenges compared to traditional query optimization in database systems. There have been many attempts both in the industry and the research community to support cost-based optimization techniq...
متن کاملBlueprints for ETL workflows
Extract-Transform-Load (ETL) workflows are data centric workflows responsible for transferring, cleaning, and loading data from their respective sources to the warehouse. Previous research has identified graphbased techniques that construct the blueprints for the structure of such workflows. In this paper, we extend existing results by explicitly incorporating the internal semantics of each act...
متن کامل